The language is not the point

Published: 2025-11-21

Introduction

This post is decidedly non-quantitative (of course, this is par for the course with any blog post about this topic) -- vibes only! In the future, I hope to remedy this as part of a research program.

I'm creating a programming language by, primarily, driving LLM agents. The overwhelming consensus is that such an endeavor is foolhardy. Protests of "these systems will pull the wool over your eyes!" and "they can't do that, " are echoing through the halls of the internet as I type these words.

But I have a cheap trick up my sleeve: I've divested myself from the outcome. I'm a 5th year PhD student with a gung ho / risky attitude. It's gotten me in trouble before, and it's not a good attribute to become a successful and well-known researcher. I do stupid things, and spend much too long on them. But I think I've relegated myself to my instincts, and I'd prefer to do something I want to do rather than something I have to do.

In this case, however, I believe this exercise has a point. A programming language (and a compiler, and a runtime system) are complex computational artifacts. These systems integrate several layers of careful logic. Inspired by Terence Tao's recent meditations on the usage of LLMs to accelerate his mathematical inquiries, I'm a programming languages researcher, and I'm intensely curious: can these internet-scale amortized distributions multiply my own abilities (which are decidedly less than Tao's)?

Less self-centered: can these artifacts be effectively coerced to produce sound code which satisfies a complex computational design?

If the answer is yes, and there's repeatable strategies for exercising the yes -- that's a worthwhile endeavor. This is a question about pushing these systems to their limits. As far as I can tell from the "user-facing marketing", OpenAI and Anthropic and Google are focusing their marketing on fucking CRUD apps. I don't give a flying fuck about CRUD apps. Stop infantilizing me with your garbage. I want a sharp fucking tool that makes my processor bleed. Can you help me craft such a tool or not?

What I'm not telling you is that I've already spent on order of 6-8 months using agents to do various things for me. These things include:

successfully coercing Claude Code to write complex Bayesian inference experiments using a probabilistic programming language that my collaborators and I developed.
unsuccessfully coercing Opus to write technical content (the RLing or agentic harness logic makes Opus garbage at this -- I spent O(10 hours), and the writing was ... irredeemable, unrelenting, as if someone seriously misunderstood the weight of words)
a series of (mixed success) experiments coercing Claude Code to transfer several robotics algorithms from C++ to Nvidia Warp
successfully using Claude Code to prototype various LaTeX artifacts ("how should I handle this with TiKZ again?").
successes on various run-of-the-mill automation tasks around my device (you know, Bash-fu Unix CLI stuff that I should really know, but don't)

Claude Code was my introduction to these systems. Before Claude Code, I was a naysayer. Then I went on vacation with my family during a technical paper push, and had to crank out several Bayesian inference experiments to illustrate the features of a system I was developing, and I become a convert.

It was night and day for me. In the dark of night, I didn't believe that LLMs would lead to useful tools -- I didn't pay attention to them, I worked on my work, I was happy. Then, as I watched Claude Code write a "robot in a world" simulator as a probabilistic program in the probabilistic programming language we were developing, I had a bit of a spiritual enlightening. I remember the days that followed only as a haze of adrenaline, frantic text messages to some of my more patient collaborators, and frenzied pseudo-crackpot discussions with my wife and family.

My collaborators and I quickly ran into the limitations of Claude Code, especially under usage in an organization. Here's the central problem: every human needs to have a theory of the codebase to work effectively together. These tools can significantly damage your theory building abilities. If you crave automation, and you give into your demons -- you're shooting yourself in the foot. It's very obvious that this little problem hasn't been figured out yet.

To combat this in an organizational setting, one needs to adopt a philosophy of small, modular PRs. You can drive the agent to write code, but you need to be able to consume it. Reviewing, as many internet commentators have pointed out, becomes the bottleneck -- and now doubly so, because you didn't write the code in the first place ... you're behind on your theory building.

For now, at least, greenfield solo projects are "safer".

I quickly moved onto using both Codex and Gemini via the VSCode Copilot interface. Each of these tools has a "flavor", and one develops an intuitive theory of mind for each of them (which, of course, has to change with every fucking update to the agentic harness or model).

Here's some of my abbreviated take aways (including with more recent models) to convey some level of expertise to those who have wasted as much time as I have on these distributions:

Claude Code (Opus 4 / Opus 4.1 / Sonnet 4.5) is easy to steer, but suffers from a high-level of mistrust in the written code. Claude Code has delightfully fucked me more times than I can count. I've cussed at Claude Code more than any other agent.
Codex (GPT 5 / GPT 5.1 high thinking) is balanced, and sometimes critical. Codex takes significantly longer to decide how to write code than Claude Code, and it pursues changes in a tempered and incremental way. If the level of complexity of the task is too high for Codex, it will refuse to code, and it will find clever ways to refuse to code. On the other hand, it is exceptionally good at finding holes (problems).
As of my testing the last few days, Gemini 3 is somewhere in the middle. Codex seems to always find problems with Gemini 3's code -- but Gemini 3 is willing to write code when Codex will not, and it feels more tempered and careful than Claude Code.

The clear empirical answer for me is that there is value in these artifacts. But there are also obvious and significant limits as soon as you start trying to use these tools for serious business. If one takes Claude Code out of the box, for instance, and starts driving it to implement a language, you're quickly going to run into its weaknesses -- meaning the properties of the distribution over tokens which arise from the (tuned model + agentic harness) which are adversarial to your goals, as expressed in your context window with the agent.

A conjecture about the difficulty of context engineering

How exactly does one use agents to do something complicated? I'd argue that the following are the fundamental weaknesses for agents today -- thinking about them as (tuned models + agentic harnesses):

agents struggle with what to pay attention to, and you have to curate it for them
a complicated computational artifact often implies a complicated theory ... meaning the difficulty of choosing what to pay attention to scales with complexity

Therefore, trivially (you didn't need me to say this) -- using agents to make something complex scales poorly with complexity. You're fighting against the complexity to get into a "golden" part of the agent's distribution, and it becomes harder and harder to do this as the codebase or task becomes more complex.

Every new agentic IDE is attempting to tame this observation. Whether it's Kiro's "spec-driven development", or Antigravity's agent manager mode, or the internally hidden logic of Claude Code ... the "features" of these IDEs are attempts to confront this issue.

And then you see a video about using such an IDE to make a fucking webpage and you mentally prepare to deallocate yourself, harakari style, to combat the dishonor of cooking a thousand GPUs for that. Can we be serious?

Ignoring my disdain, the teams behind these systems have made this observation above, which is a true observation, and that's what we need to contend with if we want to get these things to summon a complex computational artifact.

Summoning

I've been experimenting with multi-agent orchestration patterns myself. Locally sourced, small batch terminal orchestration ... by hand.

Worse, it seems to be working ... at least, it seems to have increased the complexity limit on the computational artifact that I can effectively get agents to work on.

I no longer use a single agent, I always use combinations of agents. Why? Because it seems (we're in vibe land now!) like things go better when I keep agents in specific roles, and play them off against each other.

Somewhere ~40% into the context window for certain roles, magic starts to happen. Red team Codex starts nailing Claude to the wall using our spec as a crucifix. Claude squeals uncontrollably "You're absolutely right! Please, father, let me try again?" This goes on, and work gets done - syntax gets parsed correctly, lowered, executed. Holes are filled -- holes that only Codex can see (for I am blind, I have willingly blinded myself so that I cannot bear witness to my own sins).

Now, such debauchery can go off the rails: that's why I need to keep Codex honest - "what is the discrepancy between our implementation and the spec?" or "can we really handle this feature?" My prodding often finds gaps in understanding: if the gaps are too large, I must delete entire sequences of commits and try again. But code is cheap at this party, and flows like wine. Deletion, followed by a resampling ... is often more effective than incremental change. (Remember, greenfield I said!)

There's a visceral experience of finding a "golden context stream" through the system -- where you've got your documents in order, the randomness has aligned correctly, the attention heads are absolutely beaming, and you can tell that the silicon is locked in... they say that hope was the last thing remaining in Pandora's box.

I can't say with certainty today if one can engineer anything serious with this type of alchemy. It is, without question, a form of gambling -- but, like Monte Carlo, I'm holding out hope that some of my investigations will lead to design insights for controlling the randomness in a way which is significantly more repeatedly than my depraved ramblings to let Codex red team Claude.

What I've found so far is that, while multi-agent strategies tend to keep the overall system (codebase and agents) aligned with the specifications that I work on, the specifications can end up describing an artifact which is inconsistent or flawed. This is not an agent problem! This is a me problem!

Unfortunately, no agent has accelerated my thinking at this level of work. The closest has been ChatGPT's "research mode" with GPT 5.1 -- but ultimately, I need to be responsible for carefully understanding whether the features of my design are going to cohere together. But this, for me, is definitely the fun and imaginative part.

Samten bardo

If you work in this way for long enough, the agents and their roles become a new sort of digital appendage: fuzzy feelers that let you grope around in darkness, illuminating different parts of a Cthulhian elephant that you've brought into being. It's all fuzzy -- and it's giving "perfectly stable to bet a large portion of the economy on". Eventually, you have to cut your eyelids open and reach your hands into the filth. Somewhere in the mass of stochastic tentacles lies a perfect codebase -- like dumb monkeys, we try to sample Shakespeare.

With that, I'll leave you with a final note: there's something bizarre that I've experienced where failure of these multi-agent context engineering strategies indicates an issue in my design. This feels quite strange -- and it involves that "intuitive theory of mind" business where one becomes accustomed to agents solving certain categories of tasks -- if they can't solve something, you end up looking closely and realizing that your design has a flaw. This experience, perhaps, has been the most surprising of all.

← Back to home

Introduction

The blind leading the blind

A conjecture about the difficulty of context engineering

Summoning

Samten bardo